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DETAILED ACTION 

Response to Amendment 

1. Claims 1-19 are pending. 

2. Claims 1,17, and 18 have been amended. 

3. The previous rejections under 35 (JSC 112 1 st paragraph have been withdrawn in 
view of the amendments received 7/22/2009. 

4. The previous rejections under 35 (JSC 101 have been withdrawn in view of the 
amendments received 7/22/2009. 



Response to Arguments 

5. Applicant argues "Even though there may be "human assistance" in the 
generation of Chou's network of key-phrase and filler-phrase grammars, there is no 
suggestion in Chou that such "human assistance" represents an "indication that a 
spoken event in a first set of audio signals is of interest to a user." (Remarks, Page 8, U 
3) The arguments are directed to claim language that has been amended since the last 
action and therefore the limitations are not recited in the rejected claims. Although the 
claims are interpreted in light of the specification, limitations from the specification are 
not read into the claims. See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. 
Cir. 1993). 

6. Applicant further argues "Foote does not disclose or make obvious "locating ... 
putative instances of the spoken event of interest in the second audio signal" using a 
representation of the spoken event of interest that is formed by "receiving an indication 
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that a spoken event in a first set of audio signals is of interest to a user, identifying two 
or more instances of the spoken event of interest in the first set of audio signals, and 
representing each identified instance of the spoken event of interest in the 
representation of the spoken event of interest using at least one sequence of subword 
units," as required in amended claim 1." (Remarks, Page 10, U 2) The arguments are 
directed to claim language that has been amended since the last action and therefore 
the limitations are not recited in the rejected claims. Although the claims are interpreted 
in light of the specification, limitations from the specification are not read into the claims. 
See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). 

Claim Objections 

7. Claim 13 is objected to because of the following informalities: There should be a 
space between "claim" and "1". Appropriate correction is required. 

Claim Rejections - 35 USC §112 

The following is a quotation of the second paragraph of 35 U.S.C. 1 12: 

The specification shall conclude with one or more claims particularly pointing out and distinctly 
claiming the subject matter which the applicant regards as his invention. 

8. Claims 17-19 recite the limitation "unknown speech". There is insufficient 
antecedent basis for this limitation in the claim. The claim amendments to claim 1 
appear to remove this term but it was overlooked in claims 17-19. Appropriate 
correction is required. 
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Claim Rejections - 35 USC § 103 

The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

9. Claims 1-4, 8-9, 12-19 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Cardillo et al. (NPL Document Phonetic Searching vs. LVCSR: How 
to Find What You Really Want in Audio Archives") in view of Wolf et al. (US PGPUB 
#20030204492) 

As per claim 1 , Cardillo teaches: 

accepting, by a word spotting engine of the word spotting system, data 
representing the unknown speech in a second audio signal; and (Fig. 1 , 
preprocessing phase and generation of the search track. Page 11, column 2, ...A 
phonetic grammar likewise depends upon the natural language in use (particularly the 
set of phonemes used to represent basic sounds and meanings of the input speech). 
This grammar is used to identify likely end points of words in the input speech...) 

Cardillo fails to fully teach, but Wolf teaches: 

forming, by a query recognizer of a word spotting system, a representation of a 
spoken event of interest, wherein the forming includes receiving an indication that a 
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spoken event in a first set of audio signals is of interest to a user, identifying two or 
more instances of the spoken event of interest in the first set of audio signals, and 
representing each identified instance of the spoken event of interest in the 
representation of the spoken event of interest using at least one sequence of subword 
units; (Cardillo, Page 15, column 2, teaches multiple word queries which are two 

or more instances of a event of interest. They are performed in a phonetic search 
therefore they comprise at least one sequence of phonemes each. Again, Cardillo fails 
to specifically teach a spoken event of interest. Wolf, If 0054, ...a phoneme lattice, as 
described above, can also be used for devices with limited resources... In the case 
where the recognizer is part of the input device, e.g., a cell phone, the lattices can be 
forwarded to the search engine 190...) 

locating, by the word spotting engine of the word spotting system, putative 
instances of the spoken event of interest in the second audio signal using the 
representation of the spoken event of interest, wherein the locating includes identifying 
time locations of the second audio signal at which the spoken event of interest is likely 
to have occurred based on a comparison of the data representing the unknown speech 
with the representation of the spoken event of interest. (Cardillo teaches in 

Fig. 1 , a searching phase which locates putative instances of a query where ,Page 12, 
Temporal_Offset teaches a time location. Cardillo fails to specifically teach a spoken 
event of interest (spoken query). Wolf, abstract, ...A spoken query is represented as a 
lattice... ) 
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It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 

As per claim 2, claim 1 is incorporated and Cardillo teaches: 

wherein forming the representation of the spoken event of interest comprises 
applying a computer- implemented speech recognition algorithm to data representing 
the first set of audio signals. (Page 1 1 , Preprocessing, Acoustic Model, and 

Phonetic Grammar, ...preprocessing engine scans the input speech and produces the 
corresponding phonetic search track...) 

As per claim 3, claim 1 is incorporated and Cardillo teaches: 

wherein the subword units include linguistic units. (Page 1 1 , Preprocessing, 
Acoustic Model, and Phonetic Grammar, ...preprocessing engine scans the input 
speech and produces the corresponding phonetic search track...) 

As per claim 4, claim 2 is incorporated and Cardillo fails to fully teach but Wolf teaches: 

wherein locating the putative instances includes applying a computer- 
implemented word spotting algorithm configured using the representation of the spoken 
event of interest. (Cardillo teaches phonetic keyword spotting but fails to teach 

that the event of interest (query) is spoken. Wolf, If 0054, ...a phoneme lattice, as 
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described above, can also be used for devices with limited resources... In the case 
where the recognizer is part of the input device, e.g., a cell phone, the lattices can be 
forwarded to the search engine 190...) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 

As per claim 8, claim 1 is incorporated and Cardillo fails to fully teach but Wolf teaches: 

wherein the representation of the spoken event of interest defines a network of 
subword units. (Wolf, H 0054, ...a phoneme lattice, as described above, can 

also be used for devices with limited resources... In the case where the recognizer is 
part of the input device, e.g., a cell phone, the lattices can be forwarded to the search 
engine 190...) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 

As per claim 9, claim 8 is incorporated and Cardillo fails to fully teach but Wolf teaches: 

wherein the network of subword units is formed by multiple sequences of 
subword units that correspond to different paths through the network. (Wolf, If 
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0054, ...a phoneme lattice, as described above, can also be used for devices with 
limited resources... In the case where the recognizer is part of the input device, e.g., a 
cell phone, the lattices can be forwarded to the search engine 190... Figs 3a and 3b 
show the lattices.) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 

As per claim 12, claim 1 is incorporated and Cardillo fails to fully teach but Wolf 
teaches: 

accepting first audio data representing utterances of the event of interest spoken 
by a user, and processing the first audio data to form a processed query. 
(Wolf, U 0054, ...a phoneme lattice, as described above, can also be used for devices 
with limited resources. . .In the case where the recognizer is part of the input device, 
e.g., a cell phone, the lattices can be forwarded to the search engine 190... Figs 3a and 
3b show the lattices.) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 
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As per claim 13, claim 1 is incorporated and Cardillo teaches: 

accepting a selection by the user of portions of stored data from the first set of 
audio signals, and processing the portions of the stored data to form a processed query. 
(Cardillo, Page 14, teaches the use of the AudioLogger™ product for selecting media 
segments. Although the capabilities of AudioLogger™ are not discussed in Cardillo, 
NPL Document "Inventory of Metadata for Multimedia" teaches in section 3.4 that ...The 
Virage Tools automatically index your video content. The tools use speech recognition 
technology to extract information from the audio signal... Therefore, the selection is an 
inherent feature of AudioLogger™ of which Cardillo incorporates by reference.) 

As per claim 14, claim 13 is incorporated and Cardillo teaches: 

prior to accepting the selection by the user, processing the first set of audio 
signals according to a first computer-implemented speech recognition algorithm to 
produce the stored data. (Cardillo, Page 14, teaches the use of the 

AudioLogger™ product for selecting media segments. Although the capabilities of 
AudioLogger™ are not discussed in Cardillo, NPL Document "Inventory of Metadata for 
Multimedia" teaches in section 3.4 that ...The Virage Tools automatically index your 
video content. The tools use speech recognition technology to extract information from 
the audio signal. A publishing application makes it possible to publish the searchable 
information... Therefore, prior to the selection of the searchable information, a speech 
recognition algorithm is performed to produced the stored searchable data. The 
selection is an inherent feature of AudioLogger™ of which Cardillo incorporates by 
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As per claim 15, claim 14 is incorporated and Cardillo teaches: 

wherein the first speech recognition algorithm produces data related to presence 
of the subword units at different times in the first set of audio signals. (Cardillo 
teaches in Fig. 1 , a searching phase which locates putative instances of a query where 
,Page 12, Temporal_Offset teaches a time location. Thus the collection of subword units 
generated by the first speech recognition algorithm is indicative of subword units at 
different times in the first set of audio signals.) 

As per claim 16, claim 14 is incorporated and Cardillo teaches: 

applying a second speech recognition algorithm to the processed query. 
(Cardillo, Page 1 1 , column 2, ...It is even conceivable that dynamic channel 
and language detection could be employed to switch acoustic models during 
preprocessing... Furthermore, Page 12, column 1, ...A phonetic dictionary is probed for 
each word within the query term to accommodate unusual words (whose pronunciations 
must be handled specially for the given natural language. . . It would have been obvious 
to someone of ordinary skill in the art at the time of the invention that if the audio is 
indexed in an alternative language and a person wants to search in a different language 
(North American English vs. Castilian Spanish), a second speech recognition algorithm 
with a phonetic dictionary specific to the language could have been applied to the 
processed query.) 
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Claims 17 and 18 are the computer readable medium and hardware 
representations of the method as claimed in claim 1 . Claims 17 and 18 are rejected 
under the same principles as claim 1 for having similar limitations. Cardillo, Page 20, 
table 3 teaches computer implemented means for the method. The computer teaches a 
system, and the system cannot operate without being programmed, so a computer 
readable medium is inherent. 

As per claim 19, claim 18 is incorporated and Cardillo fails to specifically teach, but Wolf 
teaches: 

wherein the word spotter is further configured to identify time locations of the 
second audio signal at which the spoken event of interest is likely to have occurred 
based on a comparison of the data representing the unknown speech with the 
representation of the spoken event of interest. (Cardillo teaches in Fig. 1 , a 

searching phase which locates putative instances of a query where .Page 12, 
Temporal_Offset teaches a time location. Cardillo fails to specifically teach a spoken 
event of interest (spoken query). Wolf, abstract, ...A spoken query is represented as a 
lattice...) 

It would have been obvious to someone of ordinary skill in the art to combine 
Wolf with Cardillo to avoid losing information and adding ambiguity by application of a 
speech recognition process and performing traditional textual document retrieval. (Wolf, 
H 0006) 
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10. Claims 5-7, 10-11 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Cardillo et al. (NPL Document Phonetic Searching vs. LVCSR: How to Find What 
You Really Want in Audio Archives") in view of Wolf et al. (US PGPUB #20030204492) 
in view of Ferrieux et al. (NPL Document "PHONEME-LEVEL INDEXING FOR FAST 
AND VOCABULARY-INDEPENDENT VOICE/VOICE RETRIEVAL") 

As per claim 5, claim 4 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
teaches: 

selecting processing parameter values of the speech recognition algorithm for 
application to the data representing the first set of audio signals according to 
characteristics of the word spotting algorithm. (Section 2.2, ... To compute the 

optimal parameters of those models,. . . ) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 6, claim 5 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
teaches: 

wherein the selecting of the processing parameter values of the speech 
recognition algorithm includes optimizing said parameters according to an accuracy of 
the word spotting algorithm. (Section 2.2, ... To compute the optimal 
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parameters of those models, . . .maximize likelihood. . .) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 7, claim 5 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
teaches: 

wherein the selecting of the processing parameter values of the speech 
recognition algorithm includes selecting values for parameters including one or more of 
an insertion factor, a recognition search beam width, a recognition grammar factor, and 
a number of recognition hypotheses. (Ferriex, section 2.1 defines the models 

which include insertion costs, section 2.2 optimizes them.) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 10, claim 1 is incorporated and Cardillo and Wolf fail to teach, but Ferrieux 
suggests: 

wherein forming the representation of the spoken event of interest includes 
determining an n-best list of recognition results. (Section 3, Synchronized 
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Lattices, ...for each phoneme of the 1 -best sequence, the posterior probabilities of the 
N-1 other phonemes on the same interval are also given... Ferrieux further provides 
...the single best sequence of phonemes often contains errors, and although systematic 
errors are handled by the confusion matrix, it is felt that knowledge of the second-best 
match would most often yield greater precision... Therefore, given that the single best 
sequence of phonemes often contains errors, it would have been obvious to someone 
of ordinary skill in the art that multiple sequences through the lattice would have been 
beneficial to boost precision due to a higher number of sequences.) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

As per claim 1 1 , claim 1 0 is incorporated and Cardillo and Wolf fail to teach, but 
Ferrieux suggests: 

wherein each sequence of subword units in the representation corresponds to a 
different one in the n-best list of recognition results. (Section 3, Synchronized 
Lattices, ...for each phoneme of the 1 -best sequence, the posterior probabilities of the 
N-1 other phonemes on the same interval are also given... Ferrieux further provides 
...the single best sequence of phonemes often contains errors, and although systematic 
errors are handled by the confusion matrix, it is felt that knowledge of the second-best 
match would most often yield greater precision. . . Therefore, given that the single best 
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sequence of phonemes often contains errors, it would have been obvious to someone 
of ordinary skill in the art that multiple sequences through the lattice would have been 
beneficial to boost precision due to a higher number of sequences. The phoneme 
sequences are the matches (which defines the first and second options) in Ferrieux, 
therefore it would have further been obvious that the phoneme sequences through the 
lattice would have been each of the different options in the n-best criterion evaluations.) 

It would have been obvious to someone of ordinary skill in the art at the time of 
the invention to combine Ferrieux with Cardillo and Wolf to optimize the parameters of 
the system based on the desired outcome (maximize likelihood/minimize error retrieval 
rate). (Ferrieux, section2.2) 

Conclusion 

11. The prior art made of record and not relied upon is considered pertinent to 
applicant's disclosure. Refer to PTO-892, Notice of References Cited for a listing of 
analogous art. 

Applicant's amendment necessitated the new ground(s) of rejection presented in 
this Office action. Accordingly, THIS ACTION IS MADE FINAL. See M PEP 
§ 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 
CFR 1.136(a). 

A shortened statutory period for reply to this final action is set to expire THREE 
MONTHS from the mailing date of this action. In the event a first reply is filed within 
TWO MONTHS of the mailing date of this final action and the advisory action is not 
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mailed until after the end of the THREE-MONTH shortened statutory period, then the 
shortened statutory period will expire on the date the advisory action is mailed, and any 
extension fee pursuant to 37 CFR 1 .136(a) will be calculated from the mailing date of 
the advisory action. In no event, however, will the statutory period for reply expire later 
than SIX MONTHS from the date of this final action. 

12. Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to GREG A. BORSETTI whose telephone number is 
(571)270-3885. The examiner can normally be reached on Monday - Thursday (8am - 
5pm Eastern Time). 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, RICHEMOND DORVIL can be reached on 571-272-7602. The fax phone 
number for the organization where this application or proceeding is assigned is 571- 
273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 
USPTO Customer Service Representative or access to the automated information 
system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
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